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Abstract — This  paper  presents  a  streaming  approach  to  solve 
the  truth  estimation  problem  in  crowdsourcing  applications. 
We  consider  a  category  of  crowdsourcing  applications  where 
a  group  of  individuals  volunteer  (or  are  recruited  to)  share 
certain  observations  or  measurements  about  the  physical  world. 
Examples  include  reporting  locations  of  gas  stations  that  remain 
operational  after  a  natural  disaster  or  reporting  locations  of 
potholes  on  city  streets.  We  call  such  applications  social  sensing. 
Ascertaining  the  correctness  of  reported  observations  is  a  key 
challenge  in  such  applications,  referred  to  as  the  truth  estimation 
problem.  This  problem  is  made  difficult  by  the  fact  that  the 
reliability  of  individual  sources  is  usually  unknown  a  priori,  since 
any  concerned  citizen  may,  in  principle,  participate.  Moreover, 
the  timescales  of  crowdsourcing  campaigns  of  interest  can  be 
as  small  as  a  few  hours  or  days,  which  does  not  offer  enough 
history  for  a  reputation  system  to  converge.  Instead,  recent  prior 
work,  including  our  own,  developed  fact-finding  algorithms  to 
solve  this  problem  by  iteratively  assessing  the  credibility  of 
sources  and  their  claims  in  the  absence  of  reputation  scores.  Such 
algorithms,  however,  operate  on  the  entire  dataset  of  reported 
observations  in  a  batch  fashion,  which  makes  them  less  suited 
to  applications  where  new  observations  arrive  continually.  In 
this  paper,  we  describe  a  streaming  fact-finder  that  recursively 
updates  previous  estimates  based  on  new  data.  The  recursive 
algorithm  solves  an  expectation  maximization  (EM)  problem  to 
determine  the  odds  of  correctness  of  different  observations.  We 
compare  the  performance  of  our  recursive  EM  algorithm  to  a 
batch  EM  algorithm,  as  well  as  to  several  state-of-art  fact-finders 
through  extensive  simulations.  We  also  demonstrate  convergence 
of  the  recursive  algorithm  to  the  results  of  the  batch  version 
through  a  real  social  sensing  experiment.  Our  evaluation  shows 
that  the  proposed  approach  can  process  data  streams  much  more 
efficiently  while  keeping  the  truth  estimation  accuracy  close  to 
that  of  the  (much  slower)  batch  algorithm.  Ours  is  therefore 
the  first  fact-finder  developed  with  explicit  consideration  to  the 
continuous  update  needs  of  crowd-sourcing  applications. 

Index  Terms — real-time,  truth  discovery,  recursive  expectation 
maximization,  streaming  data,  social  sensing 

I.  Introduction 

This  paper  presents  a  recursive  fact-finding  solution  to  the 
truth  estimation  problem  in  social  sensing.  We  refer  by  social 
sensing  to  a  broad  set  of  crowdsourcing  applications,  where 
individuals  volunteer  or  are  recruited  to  collect  data  about  the 
physical  environment.  For  example,  they  may  report  events 
of  mutual  interest  or  download  a  cell-phone  application  to 
perform  specific  sensor  data  collection  and  sharing  tasks.  Due 
to  the  potentially  unreliable  nature  of  such  unvetted  human 


sources  (and  the  potential  problems  with  their  sensors,  if  used), 
a  key  challenge  in  social  sensing  applications  is  to  assess  the 
likelihood  of  correctness  of  reported  data.  We  call  it  the  truth 
estimation  problem. 

Reputation  systems  [16]  have  been  successful  at  assessing 
quality  of  providers  (e.g.,  the  reliability  of  data  sources)  when 
the  same  providers  repeatedly  execute  transactions  that  can 
be  scored  by  others.  In  contrast  to  such  scenarios,  we  are 
specifically  interested  in  short-lived  crowdsourcing  campaigns 
(e.g.,  to  support  post-disaster  recovery  and  rescue  missions, 
which  may  last  for  only  a  few  days),  where  anyone  can 
volunteer  and  where  there  is  not  enough  history  to  accu¬ 
mulate  meaningful  reputations.  For  example,  consider  the 
recent  severe  gas  shortage  around  New  York  City  in  the 
aftermath  of  hurricane  Sandy.  Social  networks,  such  as  Twitter 
carried  tens  of  thousands  of  tweets  on  the  availability  of  gas 
at  different  stations,  but  the  reliability  of  the  corresponding 
tweeters  remained  unknown. 

Fact-finder  algorithms  [25],  [27],  [36]  have  been  proposed 
that  use  unsupervised  machine  learning  techniques  to  assess 
data  reliability  directly  from  multitudes  of  unreliable  claims, 
whose  sources  may  not  have  a  known  history  in  advance. 
The  problem  was  also  explored  in  data  mining  literature  [12], 
[18],  [37],  with  intuitions  tracing  back  to  Google’s  original 
PageRank  [3],  [23].  These  solutions  iteratively  rank  claims 
and  sources  to  jointly  assess  the  reliability  of  both,  without 
requiring  sources  to  explicitly  comment  on  each  other’s  per¬ 
formance.  Unfortunately,  they  use  batch  algorithms,  designed 
to  run  on  a  static  dataset.  As  such,  they  are  not  well-suited  to 
processing  streaming  data  for  applications  such  as  crowdsourc¬ 
ing,  where  new  observations  continue  to  arrive  over  time.  The 
batch  algorithms  will  either  need  to  operate  on  a  growing  data 
set  as  new  data  arrive  (which  does  not  scale),  or  ignore  some 
previously  computed  results  and  run  from  scratch  on  a  sliding 
recent  data  window  (which  does  not  exploit  all  available  data). 

In  contrast,  the  main  contribution  of  this  paper  is  to  develop 
a  recursive  fact-finder,  based  on  expectation  maximization 
(EM)  that  operates  on  new  data  only,  as  it  arrives,  updating 
previous  truth  estimates  (i.e.,  estimates  of  correctness  of  re¬ 
ported  data)  in  a  manner  that  approximates  running  an  optimal 
batch  algorithm  on  the  entire  augmented  dataset.  To  the  best  of 
our  knowledge,  the  streaming  EM  scheme  proposed  in  this  pa- 
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Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 
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per  is  the  first  on-line  fact-finding  approach  designed  to  solve 
the  truth  estimation  problem  in  social  sensing  applications, 
where  there  is  no  prior  knowledge  on  source  reliability  and 
no  immediate  way  to  verify  the  correctness  of  the  collected 
data.  The  streaming  EM  scheme  is  derived  by  formulating 
an  optimization  problem  (in  the  sense  of  maximum-likelihood 
estimation)  and  approximating  the  optimal  solution  using 
results  from  estimation  theory. 

In  our  evaluation,  we  study  the  performance  of  the  new  re¬ 
cursive  EM  scheme  by  comparing  it  to  a  previously-proposed 
batch  EM-based  fact-finder  [36]  and  several  other  state-of- 
art  fact-finders  [18],  [25],  [37]  through  extensive  simulations. 
The  recursive  algorithm  is  shown  to  have  a  better  performance 
trade-off  between  estimation  accuracy  and  algorithm  execution 
time  than  all  baselines.  We  also  evaluate  the  performance 
of  the  recursive  EM  scheme  through  a  real  social  sensing 
application.  The  results  demonstrate  convergence  of  the  re¬ 
cursive  algorithm  in  quality  to  results  of  the  corresponding 
(optimal  but  much  slower)  batch  EM  algorithm  if  run  on  the 
entire  data  set.  The  results  of  this  paper  are  important  because 
they  allow  social  sensing  applications  to  estimate  data  quality 
and  participant  reliability  from  streaming  data  on  the  fly, 
even  in  short-lived  crowd-sourcing  campaigns  with  no  prior 
information  on  participants. 

Finally,  it  is  pertinent  to  note  components  that  fall  out¬ 
side  the  scope  of  this  work.  First,  we  restrict  this  work  to 
improving  the  data  processing  algorithm  on  the  back-end. 
The  mechanisms  used  on  the  front-end  for  data  collection 
from  participants  are  outside  scope.  For  example,  a  cell-phone 
application  might  be  used  to  report  participants’  observations. 
We  also  do  not  address  security  as  part  of  this  work  and 
do  not  claim  the  system  to  be  attack-proof.  Instead,  we 
simply  contend  ourselves  with  assuming  that  mechanisms 
are  in  place  to  increase  the  cost  of  identity,  sybil  and  other 
attacks,  and  that  the  volunteer  participants  in  our  applications 
(e.g.,  post-disaster  rescue)  are  generally  well-meaning  and 
have  no  incentive  to  disrupt  operation.  For  example,  phone 
companies  already  keep  track  of  identities  of  individual  phones 
(e.g.,  for  billing  purposes),  which  we  can  leverage  to  identify 
unique  sources.  Finally,  we  assume  that  campaign  participants 
operate  individually.  Hence,  to  a  first  degree  of  approximation, 
reports  from  different  sources  may  be  considered  conditionally 
independent. 

With  these  caveats,  the  rest  of  the  paper  is  organized  as 
follows:  We  briefly  go  over  the  model  of  the  truth  estimation 
problem  in  Section  II  and  propose  the  recursive  EM  algorithm 
in  Section  III.  Evaluation  results  are  presented  in  Section  IV. 
We  then  review  the  related  work  in  Section  V  and  conclude 
the  paper  in  Section  VII. 

II.  Truth  Estimation  in  Social  Sensing 

Social  sensing  addresses  the  challenge  of  estimating  some 
pertinent  “state  of  the  world"  from  reports  by  human  sources. 
In  this  paper,  we  model  the  state  of  the  world  by  a  set  of 
true/false  statements  (e.g.,  “The  Golden  Gate  bridge  is  on 
fire”,  “The  435  Main  Street  gas  station  is  out  of  power”,  or 


“The  5th  Avenue  and  34th  Street  intersection  is  flooded”). 
Such  a  binary  approach,  while  simple,  is  a  powerful  tool 
to  articulate  arbitrarily  complex  conditions.  It  is  also  well- 
suited  to  geotagging  campaigns  that  mark  locations  of  some 
conditions  of  interest  (e.g.,  locations  of  street  flooding  after  a 
thunderstrom).  For  example,  each  location  may  be  associated 
with  a  number  of  Booleans  indicating  the  presence  or  absence 
of  different  types  of  damage.  A  report  from  a  source  conveys 
one  or  more  claims,  each  presenting  the  value  of  one  of  these 
Booleans.  The  “ground  truth”  state  is  unknown  and  needs 
to  be  reconstructed  as  accurately  as  possible  from  claims  by 
different  sources,  whose  reliability  is  unknown. 

More  formally,  consider  a  social  sensing  application  model, 
where  a  group  of  M  participants  (sources),  Si,...,Sm,  col¬ 
lectively  make  observations  about  N  measured  Boolean  vari¬ 
ables,  Ci,...,  Gat,  which  are  of  interest  to  the  application. 
We  assume,  without  loss  of  generality,  that  the  “normal” 
state  of  each  (Boolean)  variable  is  negative  (e.g.,  a  place 
is  not  damaged).  Hence,  participants  only  report  when  the 
positive  state  of  the  measured  variable  (repair  is  needed)  is 
encountered.  Each  source  generally  reports  only  a  subset  of 
the  variables  (e.g.,  those  at  the  places  they  have  been  to). 
The  goal  of  truth  estimation  in  social  sensing  is  to  jointly 
calculate  the  reliability  of  participants  (i.e.,  the  probability  that 
a  participant  reports  correct  observations)  and  the  correctness 
of  observations,  given  only  who  reported  what. 

Importantly,  in  crowdsourcing  applications,  the  observations 
from  participants  don’t  come  all  at  once.  Instead,  updates  are 
reported  over  the  course  of  the  campaign,  lending  themselves 
better  to  the  abstraction  of  a  data  stream  arriving  from  the 
community  of  sources.  In  our  previous  work,  we  developed 
a  batch  EM  (expectation  maximization)  algorithm  to  solve 
the  truth  estimation  problem  based  on  a  maximum  likelihood 
estimation  hypothesis  [36].  As  its  name  suggests,  the  batch 
EM  scheme  is  designed  to  run  in  a  batch  mode,  which  is  not 
suitable  for  continuously  arriving  data.  This  is  because,  every 
time  a  new  report  arrives,  the  batch  EM  algorithm  needs  to  be 
re-run  on  the  whole  data  set  from  scratch.  Considering  such 
inefficiency,  this  paper  designs  a  new  fact-finding  approach 
based  on  a  recursive  EM  algorithm  to  update  estimation  results 
on  the  fly  in  view  of  newly  arriving  data. 

Following  the  terminology  of  previous  work  [33]-[36],  let 
us  define  a  few  notations  we  will  use  in  the  following  sections. 
Let  Sj  denote  the  ith  source  and  Cj  denote  the  jth  measured 
variable.  Let  denote  whether  source  S,  reports  measured 
variable  Cj.  The  matrix  representing  who  reported  what  is 
called  the  observation  matrix  X,  where  XtJ  =  1  when  source 
Si  reports  that  Cj  is  true,  and  Xtj  =  0  otherwise.  Let  Tj 
represent  the  ground  truth  value  of  Cj  (i.e.,  Tj  is  1  if  Cj  is 
true  and  0  otherwise).  Participant  reliability  f,;  is  defined  as  the 
probability  that  the  participant  is  right  in  a  randomly  chosen 
measured  variable  he/she  reported.  Formally,  tj  is  defined  as  : 

U  =  P(Tj  =  1|X^  =  1)  (1) 

Let  us  also  define  two  more  important  conditional  proba¬ 
bilities:  a i  is  the  (unknown)  probability  that  source  Si  reports 


A.  The  Derivation 


a  variable  to  be  true  when  it  is  indeed  true,  and  6,;  is  the 
(unknown)  probability  that  source  Si  reports  a  variable  to  be 
true  when  it  is  in  reality  false.  Formally,  a*  and  h,  are  defined 
as  follows: 

cti  =  P(Xitj  =  1| Tj  =  1)  bi  =  P{Xij  =  1| Tj  =  0)  (2) 


In  estimation  theory,  a  recursive  formula  of  the  EM  scheme 
estimates  parameters  of  the  model  in  consecutive  time  intervals 
as  follows  [30]: 

9k+i  —  9k  +  {{k  +  l)Ic(0k)}  1ip(Xk+i,9k)  (6) 


The  relationship  between  t,,  a,  and  h,  can  be  derived  by 
the  Bayes’  theorem: 


ti  X  Si, 


bi  = 


(1  -  ti)  X  Si 

1  -d 


(3) 


where  d  is  the  overall  background  prior  that  a  randomly 
chosen  measured  variable  is  true.  Note  that,  this  value  does 
not  indicate,  however,  whether  any  particular  report  about  a 
specific  measured  variable  is  true  or  not.  d  can  be  either  chosen 
from  the  prior  knowledge  or  jointly  estimated  in  the  EM 
scheme  [36].  Finally,  st  denotes  the  probability  that  participant 
Si  reports  an  observation. 

Starting  with  a  log-likelihood  function  that  describes  the 
likelihood  of  the  observed  data  (i.e.,  who  said  what)  given 
the  estimation  parameter  defined  in  Equation  (2),  the  batch 
EM  algorithm  converges  to  the  maximum  likelihood  estimate 
of  the  variables  in  question  (in  this  case,  the  truth  values 
of  measured  variables  and  the  reliability  of  sources).  The 
likelihood  function  can  be  given  by: 


N  f  M 

L  =  n  n  (1  -  X  d  x  ZJ 

3=1  U=1 

M  "I 

+ n  b^'Cj  -  +(i_s,Cj)  x  a  -  d)  x  (i  -  Zj)  i  (4) 
i= 1  J 

where,  N  and  M  are  the  numbers  of  measured  variables  and 
sources,  respectively,  Zj  is  1  if  measured  variable  Gj  is  true 
(and  0  otherwise).  The  optimal  estimation  of  the  parameters 
in  the  batch  EM  algorithm  [36]  are  given  by: 


where  9k  is  the  estimation  parameter  by  observing  the  data  up 
to  the  time  interval  k,  I~1(9k)  represents  the  inverse  of  the 
Fisher  information  (i.e.,  Cramer  Rao  lower  bound  (CRLB)) 
of  the  estimation  parameter  at  time  k  and  ip(Xk+i,  9k)  is  the 
score  vector  of  the  observed  data  at  time  interval  k  +  1  w.r.t 
the  estimation  parameter  9k-  This  formula  basically  provides 
us  a  recursive  way  to  compute  the  estimation  parameter  in 
the  new  time  interval  (i.e.,  9k+ i)  based  on  its  estimation 
value  in  the  previous  time  interval  (i.e.,  9k),  the  CRLB  of  the 
estimation  (i.e.,  I~1{9k))  and  the  score  vector  of  the  updated 
data  observed  in  the  new  interval  (i.e.,  i()(Xk+i,  9k)).  Based 
on  our  previous  results  of  the  EM  scheme,  9k  is  the  estimation 
vector  defined  as  9k  =  (af,  a\,  ...akM\ V[,  b%,  -bkM).  4_1(4) 
and  ip(Xk+1,9k)  are  given  by  [35]: 


Ic  ( 9k)i,j 


and 


0 

afx(l-af) 

Nxd 

fcfx(l-Sf) 
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(7) 
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1  -SiCj 
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(8) 
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a  k-\- 1  ,, 

where  Zj  is  the  probability  of  the  j  measured  variable 
to  be  true  in  the  k  + 1  time  interval.  Plugging  Equation  (7)  and 
(8)  into  (6),  the  recursive  formula  to  update  the  estimation 
parameters  is  given  by: 


Ejgsj,  Zj 

EL  Zi 


b*  = 


Ki-Y. : 


N  —  Et=t  Zj 


j&SJi  Zj 
N 


(5) 


where  SJi  is  the  set  of  measured  variables  the  participant  Si 
actually  observes  and  Ki  is  its  size.  Zj  is  the  probability  of 
Cj  to  be  true  given  current  estimation  and  observed  data. 

In  this  paper,  we  design  a  new  streaming  fact-finder  based 
on  a  recursive  EM  algorithm  to  accurately  estimate  the  above 
parameters  from  streaming  data. 


III.  A  Recursive  Fact-finder 

In  the  following  subsections,  we  derive  a  recursive  formula 
for  our  fact-finder  (in  Section  III-A)  then  summarize  the  final 
algorithm  (in  Section  III-B). 


-a  k+ 1  ~  k 

di  =  CLj 


l 


Nd(k  +  l) 

k+ 1 , 


*+!  -  k 

a. 


y  Zj  (1  —  dik)  —  ^  Zj  U,, 

jesjk+1  jeSJik+1 

- fc+l  »  k  1 
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fc+l 


fc+l  ^  fc 
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jeSJi1 


(9) 


From  above  equations,  we  observe  that  the  estimation  of  the 
parameters  related  with  reliability  of  each  source  in  current 
time  interval  can  be  computed  from  their  estimations  in  the 
past  and  the  observed  data  in  the  new  interval.  Moreover, 

r7  k+iL 

Zj  is  unknown  and  can  be  estimated  by  its  approximation 

~  fc+l 

Zj  ,  which  can  be  computed  as  follows: 


Zj  =f(dik+1,bi  "  ,Xk+1) 

Ak+1  x  d 


Ak+1  xd+  Bk+1  x  (1  -  d) 


where 


^ +i  =  n  (a- (fe+i))sic,'=+i(i  _  a- (fc+i))(i  -*.++ 

i—  1 
M 

s"+i = n  (6i(fc+i))SiC/+i(i  -  bi{k+iY-s,c'+i) 


i—  1 

<++1 

difc+1  =  a+  x  -4- 


fe+l  -  k  S 
0,  =  b.j  x  - 


fc+l 


(10) 


where  sk+1  and  sk  are  the  probabilities  of  source  5)  to  report 
a  measured  variable  at  time  interval  fc  +  l  and  fc.  For  the  above 
equation  to  hold,  we  assume  source  reliability  changes  slowly 
over  time  and  can  be  treated  unchanged  over  two  consecutive 
time  intervals. 

~  fc+l 

Based  on  the  definition  of  Zj  ,  we  can  further  represent 

fc  ^  fc 

it  as  a  function  of  di  ,bi  ,  Xk+i,  the  values  of  which  we 
know  at  time  interval  fe  +  l: 


fc+i 


Zj  *?((+  5-^fci-^fc+l) 


Ck+1  x  d 


Ck+1  x  d  +  Dk+1  x  (1-d) 
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(11) 


Plugging  Equation  (11)  into  Equation  (9),  we  can  get  the 
following  recursive  computation  of  the  estimation  parameters: 


dk+1  =  dk 


l 


Nd(k  + 1) 

57  g{dik,bi  ,  Xk,  Xk+1)(l  —  dk) 

jeSJk+1 

5  ^  g(ai  ,bi  ,  A k ,  Xk+i )ds 
jes~Jik+1 
~ fc+l  -  k  1 

h  =h  +  Nd(k  +  1)  X 

X!  (1  -  9(dik,  bik,Xk,Xk+1)){  1  -  6?) 
fesjfc+1 

—  53  (1  —  g(dik,bi  ,Xk,Xk+i ))bi 

j£SJik+1 


(12) 


Additionally,  we  can  also  compute  the  updated  correctness 

-  fe+i 

of  measured  variables  (i.e,  Zj  )  as  follows: 

i/+1  =  f(dik+1,bik+\xk+1)  (13) 

where  function  /  is  the  same  as  the  one  in  Equation  (10). 

This  gives  us  the  recursive  equations  to  compute  the  esti¬ 
mation  parameters  of  our  model  in  the  current  time  interval 
based  on  the  estimations  from  the  previous  time  interval  and 
the  observed  data  up  to  now.  Therefore,  we  can  utilize  (12) 
to  keep  track  of  the  estimation  parameter  of  the  sources  that 
report  new  observations  consecutively  over  time.  We  also  note 
that  the  estimation  parameter  change  of  the  updated  sources 
will  affect  the  credibility  of  measured  variables  they  report, 
which  in  turn  will  affect  the  credibility  of  other  sources 
asserting  the  same  measured  variable.  We  call  this  credibility 
update  propagation  “Ripple  Effect”.  To  capture  such  an  effect, 
we  do  a  simple  trick:  only  run  one  EM  iteration  after  applying 
the  recursive  formula  (as  compared  to  running  the  full  version 
of  EM  from  scratch).  This  turns  out  to  be  an  efficient  heuristic 
based  on  the  following  observations:  i)  the  recursive  estimation 
already  offers  us  a  reasonably  good  initialization  on  the 
estimation  parameter;  ii)  the  credibility  change  of  sources  by 
a  few  updates  in  a  short  time  interval  is  usually  slight.  This 
allows  the  recursive  EM  to  converge  much  faster  than  the  batch 
algorithm  that  starts  from  a  random  point. 

B.  The  Final  Algorithm 


Algorithm  1  Recursive  Expectation  Maximization  Algorithm 

1:  while  new  update  Xj^+i  arrives  do 
2:  for  i  =  1  :  M  do 

3:  compute  a bi  based  on  Equation  (12) 

4:  update  dik,bi  with  b^^1 

5:  end  for 

6:  for  j  =  1  :  N  do 

„  fc+i 

7:  compute  Zj  based  on  Equation  (13) 

8:  end  for 

9:  run  one  EM  iteration  to  capture  the  “ripple  effect” 

10:  Let  Zr-  =  the  value  of  Z*  after  the  iteration 

j  “ 

11:  Let  a £  =  the  value  of  +1  after  the  iteration 

12:  Let  for  =  the  value  of  h,  after  the  iteration 

13:  for  j  =  1  :  N  do 

14:  if  Zj  >  0.5  then 

15:  Cj  is  true 

16:  else 

17:  Cj  is  false 

18:  end  if 

19:  end  for 

20:  for  i  =  1  :  M  do 

21:  calculate  tT-  from  at  b‘  based  on  Equation  (3) 

22:  end  for 

23:  k  =  k  +  1 

24:  end  while 


In  summary  of  the  recursive  EM  algorithm  derived  above, 
the  psuedocode  of  the  algorithm  is  given  in  Algorithm  1 .  The 
algorithm  runs  when  a  new  update  A+i  arrives  and  it  first 

computes  the  recursive  update  on  the  estimation  parameter 

•  ^  fc  + 1  ^  fc  + 1 

(i.e.,  cti  ,  bi  )  based  on  Equation  (12).  The  correctness 
of  measured  variables  are  consequently  updated  from  the 
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Figure  1 .  Algorithm  Performance  versus  Number  of  Participants 


estimation  parameters  based  on  Equation  (13).  The  recursive 
algorithm  runs  one  EM  iteration  to  capture  the  “ripple  effect” 
of  the  credibility  prorogation  as  we  discussed  in  the  previous 
subsection.  After  that,  we  decide  the  truthfulness  of  each 
measured  variable  Cj  at  current  time  slot  based  on  the  updated 

-  k 

value  of  Zj  (i.e.,  ZJ).  We  can  also  compute  the  reliability  of 

A  i,  i  i  a  k~\~\ 

each  source  from  the  updated  values  of  a,;  ,  6,  (i.e.,  a\ 

and  b'j  )  based  on  Equation  (3). 

IV.  Evaluation 

In  this  section,  we  evaluate  the  performance  of  the  proposed 
recursive  EM  algorithm  compared  to  the  batch  EM  algorithm 
and  three  state-of-art  fact-finders;  namely.  Sums  [18],  Average- 
Log  [25]  and  Truthfinder  [37],  For  the  batch  EM  algorithm, 
there  are  two  ways  for  parameter  initialization:  one  way  is 
to  statically  initialize  the  estimation  parameters  based  on  the 
observed  data  and  run  EM  from  scratch  (denoted  as  batch  EM- 
S)  [36]  and  the  other  way  is  to  use  the  values  computed  from 
the  previous  updates  for  the  current  initialization  (denoted 
as  EM-P).  Below,  We  first  evaluate  estimation  accuracy  and 
algorithm  execution  time  through  an  extensive  simulation 
study.  The  recursive  EM  algorithm  is  shown  to  achieve  a  better 
performance  tradeoff  compared  to  the  batch  EM  algorithm  and 
other  state-of-art  baselines.  Then,  we  empirically  demonstrate 
convergence  of  the  reclusive  EM  algorithm  to  results  of  the 


(optimal  but  slower)  batch  EM  algorithm  through  a  real-world 
social  sensing  application. 


A.  Simulation  Study 

We  begin  by  evaluating  the  performance  of  the  proposed 
recursive  EM  algorithm  in  simulation  by  measuring  (i)  the 
accuracy  of  participant  reliability  estimation,  (ii)  the  false 
positive  and  false  negative  rates  (i.e.,  claims  misclassified  as 
true  or  false),  and  (iii)  the  average  time  the  algorithm  takes  to 
process  an  update  in  different  conditions. 

We  built  a  simulator  in  Matlab  7.10.0  that  generates  a  ran¬ 
dom  number  of  participants  and  measured  (Boolean)  variables. 
A  random  probability  Pi  is  assigned  to  each  participant  S,  rep¬ 
resenting  his/her  reliability  (i.e.,  the  ground  truth  probability 
that  they  report  correct  observations).  A  “reporting  rate”  of  a 
source  is  defined  as  the  probability  that  the  source  reports 
an  observation  at  a  given  time  slot,  reflecting  the  source’s 
willingness  to  report.  At  a  given  time  slot,  for  each  participant 
Si,  the  simulator  decides  whether  or  not  the  participant  reports 
an  observation  based  on  its  reporting  rate.  Each  reported 
observation  from  S',  has  a  probability  t i  of  being  true  (i.e., 
reporting  the  value  of  a  variable  correctly)  and  a  probability 
1 —ti  of  being  false.  We  let  (,  be  uniformly  distributed  between 
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Figure  2.  Algorithm  Performance  versus  Participant  Chat  Rate 


0.5  and  1  in  our  experiments1.  The  fact-finder  is  executed 
as  reports  arrive  to  update  estimates  of  participant  reliability 
and  truth  values  of  reported  data.  Each  point  on  the  following 
curves  is  an  average  of  50  experiments. 

In  the  first  experiment,  we  evaluated  the  performance  of 
recursive  EM,  the  batch  EM,  and  other  baselines  while  varying 
the  number  of  participants  in  the  system.  The  total  number 
of  reported  variables  was  set  to  2000,  half  of  which  were 
reported  correctly.  The  reporting  rate  of  participants  was  fixed 
at  0.5.  The  number  of  participants  was  varied  from  60  to  150. 
We  simulated  100  time  slots  for  the  data  stream  generation. 
The  observation  updates  of  the  last  20  slots  were  used  to 
evaluate  the  algorithm  performance.  Reported  results  were 
averaged  50  experiments  that  differ  in  participant  reliability 
distributions.  Results  are  shown  in  Figure  1.  Observe  that  the 
recursive  EM  algorithm  takes  the  shortest  time  to  process  an 
update  while  keeping  the  estimation  accuracy  (in  terms  of 
both  participant  reliability  estimation  and  measured  variable 
classification)  very  close  to  the  batch  EM  algorithm. 

The  second  experiment  compares  the  recursive  EM  to  base¬ 
line  algorithms  when  the  source  reporting  rate  changes  from 
0.1  to  1.  Reported  results  are  averaged  over  50  experiments. 
The  results  are  shown  in  Figure  2.  We  observe  that  the 

1  In  principle,  there  is  no  incentive  for  a  participant  to  lie  more  than  50% 
of  the  time,  since  negating  their  statements  would  then  give  a  more  accurate 
truth 


recursive  EM  algorithm  continues  to  achieve  a  better  trade-off 
between  estimation  accuracy  and  execution  time:  it  runs  fastest 
while  offering  comparable  quality  to  the  batch  algorithm. 
Note  also  that  both  estimation  accuracy  and  execution  time 
of  the  studied  algorithms  improve  as  the  source  reporting  rate 
increases.  The  reason  is  that  a  higher  reporting  rate  leads  to 
more  data,  which  eventually  allows  faster  convergence  of  the 
algorithm  to  a  more  accurate  point. 

In  the  third  and  last  experiment,  we  examine  the  effect  of 
changing  the  measured  variable  mix  on  the  performance  of  all 
algorithms.  We  fixed  the  total  number  of  measured  variables 
to  be  2000  and  vary  the  ratio  of  the  number  of  correctly 
reported  measured  variables  to  the  total  number  of  reported 
variables  from  0.1  to  0.6.  The  number  of  participants  is  set 
to  120  and  source  reporting  rate  is  fixed  at  0.5.  Reported 
results  are  averaged  over  50  experiments.  The  results  are 
shown  in  Figure  3.  As  before,  we  observe  that  the  recursive 
EM  algorithm  has  the  shortest  execution  time  and  does  almost 
as  well  as  the  batch  EM  algorithm. 

The  simulation  results  show  that  the  proposed  recursive  EM 
algorithm  succeeds  at  offering  similar  estimation  accuracy  to 
its  best  batch  counterpart  while  running  significantly  faster. 

B.  A  Real  World  Case  Study 

In  this  section,  we  evaluate  the  performance  of  the  proposed 
recursive  EM  algorithm  compared  to  the  batch  EM  algorithm 
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Figure  3.  Algorithm  Performance  versus  Ratio  of  Correctly  Reported  Measured  Variables 


through  a  real  world  social  sensing  application.  The  applica¬ 
tion  targets  at  finding  the  free  parking  lots  on  the  campus  of 
University  of  Illinois  at  Urbana-Champaign  (UIUC).  The  “free 
parking  lots”  are  defined  as  the  parking  lots  that  are  free  of 
charge  after  5pm  daily  in  this  application.  The  goal  here  was 
to  see  if  our  recursive  EM  algorithm  can  track  the  performance 
of  the  batch  EM  algorithm  and  correctly  find  the  real  locations 
of  free  parking  lot  on  campus.  Specially,  we  selected  106 
parking  lots  on  campus  and  asked  volunteers  to  mark  the 
ones  they  believe  as  “Free”.  Participants  marked  those  parking 
lots  they  have  been  to  or  are  familiar  with.  We  observe  that 
various  types  of  parking  lots  exist  on  campus:  enforced  parking 
lots  with  time  limits,  parking  meters,  permit  parking,  street 
parking,  etc.  Different  parking  lots  have  different  regulations 
for  free  parking.  Moreover,  instructions  and  permit  signs  often 
read  similar  and  easy  to  miss.  Hence,  participants  are  prone  to 
make  mistakes  in  their  marks.  For  the  purpose  of  evaluation, 
we  went  to  those  selected  parking  lots  and  manually  collected 
the  ground  truth. 

In  the  experiment,  30  participants  were  invited  to  provide 
their  “free  parking  lot”  marks  on  the  106  parking  lots  (46 
of  which  are  indeed  free).  There  were  340  marks  collected 
from  participants  in  total.  We  then  ran  both  the  recursive  and 
batch  EM  algorithms  on  the  collected  marks  and  compared 
their  performance  on  identifying  the  correct  free  parking  lots. 
Results  are  shown  in  Figure  4.  We  observe  that  the  recursive 


EM  algorithm  is  able  to  track  the  performance  of  the  batch 
EM  algorithm  and  converge  to  the  number  of  free  parking 
lots  found  by  the  batch  algorithm  as  the  amount  of  marks 
used  by  the  algorithm  increases.  This  result  verified  the  nice 
convergence  property  of  the  developed  recursive  EM  algorithm 
using  real  world  data. 

It  should  be  emphasized  that  our  choice  of  application  is 
intended  to  be  a  proxy  for  other  more  pertinent  uses  of  our 
fact-finding  tool  that  are  harder  to  experiment  with  in  a  paper 
(due  to  absence  of  ground  truth).  For  example,  “free  parking 
lots”  may  stand  for  “operational  gas  stations”  in  a  post-disaster 
scenario  (such  as  the  New  York  gas  crisis  in  the  aftermath  of 
recent  hurricane  Sandy). 

We  should  also  highlight  that  we  chose  an  application  where 
ground  truth  does  not  change.  This  is  intentional,  in  order  to 
favor  our  competition  (the  batch  algorithms)  that  operate  on 
the  entire  data  set  at  once  and  hence  have  difficulty  handling 
dynamic  changes.  We  expect  the  advantages  of  our  recursive 
algorithm  to  be  more  pronounced  if  ground  truth  did  change 
during  the  experiment  (e.g.,  a  gas  station  runs  out  of  gas),  since 
it  is  easy  to  adapt  them  to  give  more  weight  to  more  recent 
measurements.  Due  to  space  limitations,  we  do  not  include  an 
evaluation  of  such  more  favorable  scenarios  to  the  recursive 
scheme. 

Finally,  we  should  note  that  we  kept  our  data  sets  small 
enough  such  that  running  the  batch  algorithm  upon  every 


update  remained  feasible  (for  evaluation  purposes,  where  each 
point  needs  50  runs).  The  real  advantage  of  the  recursive 
scheme,  however,  becomes  clear  when  the  input  volume  is 
scaled  up.  For  example,  hundreds  of  thousands  of  tweets  may 
be  received  in  the  aftermath  of  real  disaster  events.  Interpreting 
individual  tweets  as  claims,  a  recursive  fact-finder  can  rank 
the  claims  by  credibility  in  real-time  as  events  unfold,  which 
would  be  much  less  time  consuming  than  if  a  batch  fact-finder 
is  re-run  continuously  as  new  tweets  arrive.  Our  prior  work 
presents  the  results  of  applying  batch  fact-finders  to  Twitter 
data  [36];  a  painfully  slow  experience  which  motivated  this 
work. 
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Figure  4.  Recursive  EM  Algorithm  Convergence 

V.  Related  Work 

Social  sensing  which  is  also  referred  to  as  human-centric 
sensing  [4],  [20],  is  generally  achieved  by  various  kinds  of 
sensors  which  are  closely  attached  to  humans,  either  in  their 
wearable  form  or  in  their  mobile  devices  (e.g.,  cell  phones). 
A  broad  overview  of  social  sensing  applications  is  presented 
in  [1],  Some  early  applications  include  CenWits  [14],  Car- 
Tel  [15],  CabSense  [29]  and  BikeNet  [8],  More  recent  work 
has  focused  on  addressing  new  challenges  emerging  in  social 
sensing  applications  such  as  preserving  privacy  of  partici¬ 
pants  [2],  improving  energy  efficiency  of  sensing  devices  [24] 
and  measuring  the  sociability  of  participants  and  strengthening 
their  interactions  [22],  [28].  Examples  include  privacy-aware 
regression  modeling,  a  data  fusion  technique  that  produce 
the  same  model  as  that  computed  from  raw  data  by  prop¬ 
erly  computing  non-invertible  aggregates  of  samples  [2],  E- 
Gesture  is  an  energy  efficient  gesture  recognition  architecture 
that  significantly  reduces  the  energy  consumption  of  mobile 
sensing  device  while  keeping  the  recognition  accuracy  accept¬ 
able  [24],  SociableSense  is  a  smart  phones  based  platform 
used  to  measure  the  sociability  of  users  and  foster  interactions 
among  participants  by  studying  their  behavior  in  the  office 
environment  [28],  Nawaz  et  al.  [22]  adapted  a  similar  social 
sensing  system  to  understand  group  dynamics  and  information 
flow  at  building  construction  sites.  Our  work  complements  the 
past  work  by  addressing  the  truth  estimation  in  social  sensing 
on  the  fly. 


A  relevant  body  of  work  in  the  machine  learning  and 
data  mining  communities  performs  trust  analysis  based  on 
the  source  and  claim  information  network.  Fact-finders  are 
a  class  of  iterative  trust  analysis  algorithms  that  estimate 
both  the  credibility  of  claims  and  the  trustworthiness  of  the 
sources.  Examples  include  Sums  [18],  TruthFinder  [37],  the 
Investment,  Pooledlnvestment,  Average-Log  algorithms  [25] 
and  Bayesian  Interpretation  [32].  Many  fact-finders  also  en¬ 
hance  the  basic  trust  analysis  models.  3-Estimates  [10]  rewards 
sources  that  correctly  assert  highly  disputed  claims,  while 
AccuVote  [7]  considers  “source  dependence”  by  effectively 
boosting  the  trustworthiness  of  independent  sources.  More 
recent  works  came  up  with  some  new  fact-finding  algorithms 
designed  to  handle  domain  expertise  of  information  sources, 
multi-valued  facts  of  an  entity  and  a  subset  of  known  ground 
truth  of  variables.  Kasneci  et  al.  [17]  proposed  a  CoBayes 
scheme  to  learn  the  affinity  between  users’  expertise  and  their 
statements  by  mapping  them  into  a  common  latent  knowledge 
space.  Zhao  et  al.  [40]  presented  a  Bayesian  scheme  to  model 
different  types  of  errors  made  by  sources  and  merge  multi¬ 
valued  attributes  in  data  integration  systems.  Yin  et  al.  [38] 
provided  a  semi-supervised  approach  to  find  the  true  values 
with  the  help  of  (a  small  amount  of)  ground  truth  data.  In 
contrast,  this  paper  proposed  the  first  on-line  fact-finder  to 
solve  the  truth  discovery  problem  in  social  sensing  applications 
with  explicit  consideration  to  continuous  data  update. 

Since  people  are  an  indispensable  element  in  social  sensing, 
some  popular  attacks  originated  from  human  (or  source) 
interactions  are  interesting  to  investigate.  Collusion  attack  is 
carried  out  by  a  group  of  colluded  attackers  who  collectively 
perform  some  malicious  actions  to  defraud  honest  sources  or 
obtain  objective  forbidden  by  the  system.  This  attack  could 
be  mitigated  by  monitoring  the  interactions  or  relationships 
among  colluded  attackers  or  identifying  the  abnormal  behavior 
from  the  group  [19],  Sybil  attack  is  another  related  attack 
carried  out  by  a  single  attacker  who  intentionally  create  a 
large  number  of  pseudonymous  entities  and  use  them  to  gain 
a  disproportionately  large  influence  on  the  system.  This  attack 
could  be  mitigated  by  certifying  trust  of  identity  assignment, 
increasing  the  cost  of  creating  identities,  limiting  the  resource 
the  attacker  can  use  to  create  new  identities  and  etc.  [39].  By 
handling  reports  from  colluded  or  duplicated  sources  in  a  way 
that  takes  care  of  the  source  dependency,  we  will  be  able  to 
address  the  above  attacks  to  some  extent.  For  example,  by 
identifying  duplicate  sources,  we  can  remove  them  along  with 
their  reports  from  the  observed  dataset,  which  is  expected  to 
improve  the  estimation  performance.  Problems  become  more 
interesting  when  sources  are  not  just  duplicates  but  actually 
linked  through  the  social  network  [21]. 

Our  work  also  bears  resemblance  to  reputation  systems. 
Different  types  of  reputation  systems  are  being  used  success¬ 
fully  in  commercial  online  applications.  For  example,  eBay 
is  a  type  of  reputation  system  based  on  homogeneous  peer- 
to-peer  systems,  which  allows  peers  to  rate  each  other  after 
transactions  [13],  Our  developed  scheme  may  not  be  able 
to  be  directly  applied  to  those  systems.  The  reason  is:  the 


structure  of  a  homogeneous  peer-to-peer  system  is  commonly 
represented  by  a  mesh  network  graph  while  the  structure 
of  our  scheme  is  represented  by  a  bipartite  network  graph 
(i.e.,  sources  and  measures  are  in  disjoint  sets).  Amazon 
on-line  review  system  represents  another  type  of  reputation 
systems,  where  different  sources  offer  reviews  on  products 
(or  brands,  companies)  they  have  experienced  [16].  Customers 
are  affected  by  those  reviews  (or  reputation  scores)  in  making 
purchase  decisions.  It  turns  out  that  our  work  fits  better  into 
this  type  of  reputation  systems  and  has  the  potential  to  provide 
more  refined  and  timely  results  for  the  reputation  computation. 

The  recursive  expectation  maximization  (EM)  algorithm  is 
an  online  version  of  the  EM  algorithm  where  a  statistical 
approximation  procedure  is  applied  to  estimate  the  parameters 
in  a  recursive  and  adaptive  way  [30].  The  recursive  EM  has 
been  used  in  a  wide  range  of  applications  with  large  dynamics 
in  sensor  networks.  For  example,  Guo  et  al.  developed  a 
methodology  based  on  recursive  EM  algorithm  to  optimize 
sensor  deployment  and  adaptively  estimate  the  boundary  of 
sensor  locations  [11],  Frenkel  et  al.  applied  the  recursive  EM 
algorithm  in  a  multiple  target  tracking  scenario  and  achieved 
a  linear  computational  complexity  with  respect  to  the  target 
number  in  the  system  [9],  Chung  et  al.  derived  a  recursive 
EM  procedure  for  direction  of  arrival  (DOA)  estimation  under 
a  deterministic  model  and  independent  Gaussian  noise  [5].  In 
this  paper,  we  proposed  a  recursive  EM  algorithm  to  greatly 
reduce  the  computation  overload  of  our  previous  iterative 
algorithm  which  runs  on  a  increasing  dataset  for  streaming 
data  [36].  To  the  best  of  our  knowledge,  this  is  the  first  on¬ 
line  algorithm  that  is  developed  to  address  truth  discovery 
challenge  in  social  sensing  applications. 

VI.  Limitations  and  Future  Work 

This  paper  presented  a  streaming  fact-finding  approach 
to  address  the  truth  estimation  challenge  in  social  sensing 
applications  on  the  fly.  Several  simplifying  assumptions  were 
made  that  offer  directions  for  future  work. 

Participants  (sources)  were  assumed  to  be  independent. 
However,  in  reality,  sources  might  be  non-independent  or 
even  collude  to  mask  the  truth.  For  example,  in  Twitter, 
it  could  be  that  a  large  set  of  individuals  report  the  same 
observation  not  because  they  independently  observed  it  them¬ 
selves,  but  because  they  heard  it  from  a  source  they  trust 
(which  could  in  fact  be  wrong).  Several  techniques  have 
been  recently  developed  to  discover  source  dependencies  and 
copying  relationships  [6],  [7].  Other  approaches  were  shown 
to  efficiently  mitigate  the  source  collusion  attack  by  analyzing 
the  network  or  interaction  patterns  of  colluding  sources  [19]. 
Additionally,  source  dependencies  could  also  be  inferred  from 
the  underlying  social  network.  An  admission  controller  was 
designed  to  select  independent  sources  for  social  sensing 
applications  based  on  a  few  distance  metrics  derived  from  the 
social  network  topology  [31].  It  is  reasonable  to  integrate  the 
above  techniques  with  our  streaming  fact-finding  approach  to 
effectively  handle  source  dependencies  in  the  future. 


In  this  paper,  we  did  not  assume  dependencies  between 
measured  variables.  However,  observations  on  different  vari¬ 
ables  may  often  be  correlated.  For  example,  a  fire  report  at 
location  A  in  a  city  might  imply  a  traffic  congestion  at  location 
B  that  is  a  few  blocks  away  from  A.  Several  approaches 
have  been  proposed  to  take  the  underlying  relations  between 
measured  variables  as  prior  knowledge  [25],  [26],  Hence, 
we  can  possibly  extend  the  model  of  the  EM  scheme  to 
incorporate  dependencies  into  the  likelihood  function.  More¬ 
over,  all  observations  are  assumed  to  be  equally  important 
in  our  model.  It  is  interesting  to  extend  current  model  to 
consider  the  “difficulty”  of  making  different  observations,  and 
giving  more  weight  to  correct  reporting  of  more  difficult 
obervations.  A  few  techniques  have  been  proposed  to  consider 
the  hardness  of  observations,  which  may  be  used  together 
with  our  scheme  [10],  Additionally,  the  measured  variable 
were  assumed  to  be  Boolean  in  this  paper.  This  assumption  is 
sufficient  in  many  social  sensing  scenarios,  where  the  existence 
or  lack  thereof  a  given  condition  of  interest  (e.g.,  litter)  can 
be  represented  by  the  Boolean  variable.  Our  model  can  be 
extended  to  handle  other  discrete  measured  variables  (e.g., 
weather  in  a  city  can  be  sunny,  rainy,  or  snowy)  by  expanding 
the  number  of  estimation  parameters  to  cover  all  possible 
states  of  the  variable.  The  general  outline  of  the  derivation 
still  holds.  Having  a  basic  streaming  fact-finding  algorithm  in 
place,  we  shall  relax  the  above  assumptions  and  accommodate 
the  mentioned  extensions  in  future  work. 

VII.  Conclusion 

This  paper  described  a  streaming  fact-finding  approach 
to  address  the  truth  estimation  problem  in  social  sensing 
applications  that  allows  applications  to  process  streaming  data 
efficiently.  The  streaming  approach  is  developed  based  on  a 
recursive  EM  algorithm  that  computes  the  estimation  parame¬ 
ters  by  only  processing  the  newly  updated  data  and  combining 
the  results  with  previous  estimates.  The  performance  of  the 
streaming  fact-finder  is  evaluated  through  extensive  simula¬ 
tions.  Results  show  that  a  better  trade-off  between  estimation 
accuracy  and  algorithm  execution  time  has  been  archived  by 
the  new  streaming  approach  compared  to  the  batch  EM  scheme 
and  other  baselines.  Evaluation  data  from  a  real  social  sensing 
application  are  also  presented. 
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